HW Assignment 11- Data Competition

k-Nearest-Neighbors (k-NN), Naive Bayes, Decision Trees, Random Forests, SVC, and NN MLP

IST 5520: Data Science and Machine Learning with Python

By: Austin Funcheon and Viraj Rane

The contributions of each member is noted at every section of this notebook.

1. Data

1.1 Import Data

Contribution: Austin, Viraj

Import train_data.csv file as training data.

2. Data Preprocessing

Contribution: Austin, Viraj

3. Data Partition

Contribution: Viraj

4. Data Transformation

Normalize data

Contribution: Austin

4. Model training and testing

1. k-Nearest Neighbors (k-NN)

Contribution: Austin, Viraj

1.1 Training a k-NN Classifier

Now we will use the test dataset to assess the performance of the trained model based on k=5

1.2 Tuning K-NN classifier

Contribtion:Viraj

We will tune k hyper-parameters based on accuracy score

From the above result we can see that k = 1 has the highest value i.e., 89.75%

Now we will further tune k hyper-parameters based on AUC score

From the above result, we can observe that k hyper-parameters tuned based on Accuracy and AUC scores gives us k = 1 as the optimal hyper-parameter.

Using test dataset to assess the performance of the trained model based on k=1

From the above results, we can see that k-NN Classifier based on hyper-parameter tuning with k=1 has best results with 87.59% AUC score compared to k=5.


2. Naive Bayes Classifier

2.1 Gaussian Naive Bayes Classifier

Contribution: Austin, comments by Viraj

Now we will use the test dataset to assess the performance of the trained model

2.2 Bernoulli Naive Bayes Classifier

Contribution: Austin, Viraj

Assessing the performance of the trained model on test dataset

2.3 Multinomial Naive Bayes Classifier

Contribution: Austin

2.4 Complement Naive Bayes Classifier

Contribution: Austin

Comparing the performances of all Naive Bayes models

From the above result, we can conclude that MultiNomial Naive Bayes classifier gives the best AUC score of 91.13% compared to Gaussian, bernoulli,and complement Naive Bayes Classifier.


3. Decision Tree Classifier

Contribution: Viraj


Lets use the test dataset to assess the performance of the model.

From the above DT Classifier result, we can depict that it gives us an AUC score of 91.45% on the test dataset


4. Random Forest Classifier

4.1 Random Forest with feature importance and without hyper-parameter tuning

Contribution: Viraj

We did not apply weights to the RF classifier because it did not give similar/ better performance as compared to basic RF classifier.

Using test dataset to assess the RF model.

4.2 Random Forest with hyper-parameter tuning

Contribution: Viraj, with additional trials with trimming, balancing, etc by Austin

Training the RF model based on hyper-parameters.

Using Test dataset to assess the RF model with hyper-parameters.

4.3 Comparing both RF models

From the above result we can interpret that RF Classifier with hyper-parameters tunning gives us the best performance with an AUC score of 95.79%


5. Support Vector Machine

5.1 Linear SVC

Contribution: Viraj

5.2 Tuning Hyper-parameters

Contribution: Viraj

5.3 Comparing both SVC models

From the above result we can see that SVC with linear kernel and tuned hyper parameters gives us an AUC score of 94.48%

6. Neural Network MLP Classifier

Contribution: Austin

From the above results by using NN MLP Classifier we get an AUC score of 95%


Conclusion: From all the selected models on training the dataset, we prefer to choose Random Forest Classifier as the best supervised machine learning model with best hyper parameters; criterion': 'entropy','max_features': 1, 'n_estimators': 400,'random_state': 123. We achieved an overall AUC score of 95.79%.


We choose RF classifier as our ML model to test it performance on the test dataset by importing test_data.csv file in the summary jupyter notebook.